Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create a binary variable that identify whether another variable has duplicates in the dataset using Stata?*

    The dataset is laid out in longitudinal format where the patient id (pid) is the first column, and the second column "PE"is if
    the patient received duplicated physical check-up during their hospital stay. I want to create a new variable "RE" that can achieve
    the goal like this: if PE has duplicates, then RE=1; If not, RE=0.

    * Example generated by -dataex-.
    clear
    input byte (pid PE)
    1 1
    1 2
    1 3
    1 4
    2 1
    2 2
    2 3
    2 3
    3 1
    3 1
    3 2
    3 3
    4 1
    4 1
    4 2
    4 3
    end
    Thank you for your help!

  • #2
    Code:
    bysort pid (PE) : gen RE = sum(PE==PE[_n-1])
    bysort pid (PE) : replace RE = RE[_N] > 0

    Comment


    • #3
      Originally posted by Bjarte Aagnes View Post
      Code:
      bysort pid (PE) : gen RE = sum(PE==PE[_n-1])
      bysort pid (PE) : replace RE = RE[_N] > 0
      Thank you! Could you please explain your code?

      Comment


      • #4
        The code counts duplicates (within patients) of any value as after sorting a duplicate will show up as being equal to the previous value. If your example, the duplicates are in following observations, but the code doesn't assume that.

        The code then reduces that to 1 for duplicates exist and 0 otherwise.

        Comment


        • #5
          The concepts are explained in [D] by -- Repeat Stata command on subsets of the data, and Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

          Adding a longer more detailed explanation, hopefully not introducing errors...

          First,
          the expression gen RE = sum(PE==PE[_n-1]) is calculated for each by group pid on data sorted within by group by PE.

          Thus, within each pid with data sorted by PE:
          The system variables _n and _N represent the number of the current observation (_n) and the total number of observations (_N) observations in the current by group.

          The relational expression PE==PE[_n-1] will return 0 if false and 1 if true. Here 0 if the current observation of PE is different from the previous observations of PE. The expression will be true and return the value 1 for any record where the current observation of PE is the same as the previous observation.

          The function sum() return the running/cumulative sum over observations of the expression (PE==PE[_n-1]) . Thus, for the last observation within by group pid the running sum will be 0 if no duplicates where found.
          Second,
          bysort pid (PE) : replace RE = RE[_N] > 0 replace RE within by group with 1 or 0 depending on the expression RE[_N] > 0 which will be false returning 0 if no duplicates where found.
          The following code may help illustrate some of the above:
          Code:
          bysort pid (PE) : gen n = _n // current obs within by group
          bysort pid (PE) : gen N = _N // last obs within by group
          bysort pid (PE) : gen isN = (_n==_N) // true of last obs within bygroup
          
          bysort pid (PE) : gen byte PEeqlagPE = ( PE==PE[_n-1] ) // true of PE equals previous PE within bygroup 
          bysort pid (PE) : gen byte sumPEeqlagPE = sum(PEeqlagPE)  // cumulative/running sum of PEeqlagPE within bygroup 
          
          bysort pid (PE) : gen pidhasdups = sumPEeqlagPE[_N] > 0 // true if sumPEeqlagPE[_N] > 0 within bygroup    
          
          
          list pid PE n-pidhasdups ,  sepby(pid) compress abbrev(32)

          Comment


          • #6
            Originally posted by Bjarte Aagnes View Post
            The concepts are explained in [D] by -- Repeat Stata command on subsets of the data, and Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

            Adding a longer more detailed explanation, hopefully not introducing errors...

            First,
            the expression gen RE = sum(PE==PE[_n-1]) is calculated for each by group pid on data sorted within by group by PE.

            Thus, within each pid with data sorted by PE:
            The system variables _n and _N represent the number of the current observation (_n) and the total number of observations (_N) observations in the current by group.

            The relational expression PE==PE[_n-1] will return 0 if false and 1 if true. Here 0 if the current observation of PE is different from the previous observations of PE. The expression will be true and return the value 1 for any record where the current observation of PE is the same as the previous observation.

            The function sum() return the running/cumulative sum over observations of the expression (PE==PE[_n-1]) . Thus, for the last observation within by group pid the running sum will be 0 if no duplicates where found.
            Second,
            bysort pid (PE) : replace RE = RE[_N] > 0 replace RE within by group with 1 or 0 depending on the expression RE[_N] > 0 which will be false returning 0 if no duplicates where found.
            The following code may help illustrate some of the above:
            Code:
            bysort pid (PE) : gen n = _n // current obs within by group
            bysort pid (PE) : gen N = _N // last obs within by group
            bysort pid (PE) : gen isN = (_n==_N) // true of last obs within bygroup
            
            bysort pid (PE) : gen byte PEeqlagPE = ( PE==PE[_n-1] ) // true of PE equals previous PE within bygroup
            bysort pid (PE) : gen byte sumPEeqlagPE = sum(PEeqlagPE) // cumulative/running sum of PEeqlagPE within bygroup
            
            bysort pid (PE) : gen pidhasdups = sumPEeqlagPE[_N] > 0 // true if sumPEeqlagPE[_N] > 0 within bygroup
            
            
            list pid PE n-pidhasdups , sepby(pid) compress abbrev(32)
            Thank you!

            Comment

            Working...
            X